# install.packages('pacman')
# pacman loads multiple packages in one step (installs if needed, as well)
pacman::p_load(
tidyverse,
bskyr, # to interact with BlueSky API
ellmer # to interact with AI model
)
# https://jbgruber.github.io/atrrr/articles/Basic_Usage.html
# atrrr package is another option for pulling BlueSky dataIntroduction and setup
Why packages?
One of my favorite aspects of coding in R is the package: a bundled collection of functions, datasets, and documentation designed to extend the functionality of R. You can install and load a package by running just a few lines of R code.
Packages enable programmers to make their codes more concise (you can run many lines of code with just one function from a package) and easily share useful code they’ve written.
Maybe most importantly, this also makes R much more accessible to beginners, as they can leverage packages’ functions to execute more advanced tasks without writing all of the nuanced code to do so themselves.
Cool, right?
Today: bskyr & ellmer
This example utilizes functions from two exciting new (to me) packages:
bskyrfor interacting with the BlueSky API andellmerfrom the tidyverse for interacting with LLMs.
While there is some setup associated with the initial use, once that setup is complete, all of this can be done in RStudio itself.
You can also use bskyr to post to BlueSky from R!
For those interested, atrrr appears to be another solid option for interacting with BlueSky from R. I only chose bskyr became I came across it first.
So, let’s load those packages!
Initial external setup
Authenticating BlueSky access
The first time you connect to BlueSky via R, you will need to authenticate the access to your account, as I’ve done below, inserting your respective handle and app password. You can find more detailed instructions on this process online (such as where to go to get an app password on BlueSky), but it took me only minutes, since I already had a BlueSky account.
The code below, utilizing set_bluesky_user() from bskyr, is set up to save your credentials in your R environment, so that moving forward you will not need to run those lines.
# Authenticate BlueSky account access
# Only need to run below the first time, then saves to Renviron
set_bluesky_user("username.bsky.social", # INSERT YOUR HANDLE HERE
install = T,
r_env = file.path(Sys.getenv("HOME"), ".Renviron")
)
set_bluesky_pass("XXXX-XXXX-XXXX-XXXX", # INSERT YOUR APP PASSWORD HERE
install = T,
r_env = file.path(Sys.getenv("HOME"), ".Renviron")
)Setting up your LLM
For this simple example, I leverage a model called gemma3 through Ollama, which I can query through ellmer functions in R. Note that you can also use online LLMs, if you have an account; see ellmer documentation. For Ollama, the initial setup included:
- Downloading and installing the Ollama software, which allows you to run open-source LLMs locally - on your own computer! Thus, this is free, does not require account setup, and keeps your data on your own machine, which is more private.
- Selecting a model for Ollama to run. Note that you may have to consider memory limitations of your machine when selecting a model. My choice for this example, gemma3, excels at summarization and is under 4 gigabytes.
- Installing the model. I find that the easiest way to install models is through your computer’s command prompt, with, for example,
ollama run gemma3, which will download, install, and initialize the model. Then, you can even “speak” with it through the command prompt, like a very simple AI interface!
Your R code
Now that our BlueSky access and all of our software is ready, on to the fun part: running R code to pull data from BlueSky and get a summary from the LLM!
If at any point you have additional questions about the packages, their arguments or capabilities, reference their online documentation. This example is not an extensive guide, but a brief illustration.
Querying BlueSky
First, to get our data from BlueSky, we will use bs_search_posts().
# query to search in posts; if you have a space, it's like an AND
topic <- "public health"
# i.e., this will grab posts with 'public health' but also those with
# 'public' and 'health' separately in the post; both are required but not
# right next to each other
# there is likely a way to search more incisively but we'll leave that for
# another day
# how many posts do you want to pull?
n_posts_pull <- 20
raw_pull <- bs_search_posts(
query = topic,
limit = n_posts_pull,
sort = "latest" # get the most recent posts
)
raw_pull# A tibble: 20 × 13
uri cid author record reply_count repost_count like_count
<chr> <chr> <list> <list> <int> <int> <int>
1 at://did… bafy… <named list> <named list> 0 0 0
2 at://did… bafy… <named list> <named list> 0 0 0
3 at://did… bafy… <named list> <named list> 0 0 1
4 at://did… bafy… <named list> <named list> 0 0 5
5 at://did… bafy… <named list> <named list> 0 0 0
6 at://did… bafy… <named list> <named list> 0 0 0
7 at://did… bafy… <named list> <named list> 1 0 0
8 at://did… bafy… <named list> <named list> 0 0 0
9 at://did… bafy… <named list> <named list> 0 0 0
10 at://did… bafy… <named list> <named list> 0 0 1
11 at://did… bafy… <named list> <named list> 0 0 0
12 at://did… bafy… <named list> <named list> 1 0 0
13 at://did… bafy… <named list> <named list> 0 0 0
14 at://did… bafy… <named list> <named list> 1 0 0
15 at://did… bafy… <named list> <named list> 1 0 0
16 at://did… bafy… <named list> <named list> 0 0 0
17 at://did… bafy… <named list> <named list> 0 0 0
18 at://did… bafy… <named list> <named list> 0 0 0
19 at://did… bafy… <named list> <named list> 0 0 0
20 at://did… bafy… <named list> <named list> 0 0 0
# ℹ 6 more variables: quote_count <int>, indexed_at <chr>, viewer <list>,
# labels <list>, embed <list>, cursor <chr>
As you can see, this resulted in a table of 20 observations, as we wanted, but with 13 variables, and some of the variables are lists; it is not clear where the content of the post is located until further examination reveals it is within one of these list-variables. So, I used modify() from purrr to grab the text object from within the record list-variable.
# pull out the posts' text content
text <- modify(raw_pull$record, function(x) x[["text"]]) %>%
str_remove_all(pattern = "\n|\r|")
text [1] "No shade on experts, but honestly, I'm not an expert and this is effing obvious to me.PEOPLE WILL DIE.The CDC and the NIH and etc are excellent examples of good things that a government can do. Public health research and programs benefit *everybody*."
[2] "Working on a film about two January 6ers who get pardoned and then go to work for DOGE dismantling the public health system. It’s called “A Day of Love in the Age of Cholera.”"
[3] "Health Care in Canada needs defending against Conservative Premier’s who are dismantling the public system to favour private ownership."
[4] "The scene at NIH this morning is cruel and heartbreaking. People who dedicated their lives to protecting Americans’ public health are lining up just to be told if they’re getting fired or not. This will not make America Great or Healthy—it will delay new cures & treatment for patients in need."
[5] "And we can't fund public health or social services. We only have money for billionaire bros and rocket guys."
[6] "A coalition of state attorneys general sued the Trump administration on Tuesday over its decision to cut $11 billion in federal funds that go toward COVID-19 initiatives and various public health projects across the country. Attorneys general from 23 states filed the suit in federal court in RI"
[7] "🛢️ The coal, oil, and gas industries and their allies have spent billions of dollars trying to dupe the public into supporting products that are destroying our health and our environment."
[8] "The NKF were excited to join the LGA/ADPH Annual Public Health Conference 2025 virtually today! We look forward to three days of insightful sessions on \"Tackling health inequalities together\""
[9] "If Democrats were to believe that mass transit expansion is good (and good for public health and the environment), then having the Maryland Purple Line subject to years of lawsuits and delays (many on \"environmental\" grounds) seems counter to the end goal of the project. Just do the project!"
[10] "“This will lead to worse health outcomes, greater risks to the US public, and will contribute to the decline in US life expectancy...\" And decreases our preparedness and resilience to an epidemic or pandemic. We didn't do well fully staffed, we'll do worse now..."
[11] "massive public health and science cuts around the usa todaywake upunite"
[12] "💶What if we told you pesticide companies owe you billions?For #PesticideActionWeek, we handed a €23 billion bill to @croplifeeu.bsky.social, the pesticide lobby protecting corporations like Bayer & BASF, for the damage they cause to public health & the environment👇 www.youtube.com/watch?v=6zJ5..."
[13] "...have a look at the recommendations of the SARS Commission from two decades ago. It was unambiguous about N95s then. It's just that the people running the show in public health have more ego than brains, and decided to show us they knew better.COVID is the Dunning-Kruger pandemic."
[14] "That was not what the experts said. It was just public health leaders shouting over the experts and winging it because they want to control everything. For the actual competent expert position, read the OHS standards. This one's the 2018 version, but the 2011 version was very similar.Also..."
[15] "📢 SGIM Expresses Concern Over HHS ReorganizationSGIM is deeply concerned about the Department of Health and Human Services’ (HHS) decision to lay off 10,000 staff and restructure key programs—changes that could negatively impact public health, primary care, and health services research."
[16] "“This will lead to worse health outcomes, greater risks to the US public, and will contribute to the decline in US life expectancy...\" Yes."
[17] "There's also a great book called spillover which in its introduction basically predicted the covid-19 outbreak. It talks about how important public health is and how pandemics work. The demon in the freezer is also another interesting book, but it and The Hot Zone have creative license in them."
[18] "23 states, DC sue Trump administration over billions in lost public health funding www.cnn.com/2025/04/01/h..."
[19] "massive public health and science cuts around the us today wake up"
[20] "The demolition of the US continues apace:“‘The cuts today at CDC targeted programs that address all aspects of American lives,’ a source at the CDC tells WIRED. ‘This will lead to worse health outcomes, greater risks to the US public, and will contribute to the decline in US life expectancy…’”"
But hey, what if we want to analyze the posts’ data with some of these other useful columns we were given?
Tidying the tibble
Let’s create a tibble with the counts of likes and whatnot. Plus, maybe we’ll filter to the five posts with the most likes?
# how many posts do we want to keep to summarize?
n_posts_summarize <- 5
# add other cols, filter as desired
tidied_text <- tibble(
text = text, # that actual content
when = modify(raw_pull$record, function(x) x[["createdAt"]])
) %>%
bind_cols(raw_pull %>% select(matches("count"))) %>% # add count columns
arrange(-like_count) %>% # start with most likes
head(n_posts_summarize) # only keep the top n we specified
tidied_text# A tibble: 5 × 6
text when reply_count repost_count like_count quote_count
<chr> <lis> <int> <int> <int> <int>
1 "The scene at NIH this … <chr> 0 0 5 0
2 "Health Care in Canada … <chr> 0 0 1 0
3 "“This will lead to wor… <chr> 0 0 1 1
4 "No shade on experts, b… <chr> 0 0 0 0
5 "Working on a film abou… <chr> 0 0 0 0
Now, of the 20 most recent BlueSky posts with the words ‘public’ and ‘health’, we’ve kept the five with the most likes.
Summarizing with your LLM
Let’s ask for our summary with chat_ollama() from ellmer!
To guide the model towards the desired outcome, provide a system_prompt argument. If you do not, your results may be more inconsistently formatted, even if you include similar instructions in the later chat.
# https://ellmer.tidyverse.org/reference/chat_ollama.html
chat <- chat_ollama(
model = "gemma3",
system_prompt = paste0(
"You are an assistant to a very ",
"busy professional, to whom you ",
"must provide a concise summary ",
"of the following information."
)
)
chat$chat(paste0(
"Synthesize a short summary paragraph (3-4 sentences total)",
" on these posts from BlueSky, from various users.",
tidied_text$text %>% paste0(collapse = ". ")
))Here’s a concise summary for your professional:
Current events at the NIH highlight a critical issue: the potential dismantling
of vital public health infrastructure, jeopardizing American health outcomes
and preparedness for future crises. Multiple voices express concern over cuts
to organizations like the CDC and NIH, warning of negative consequences
including delayed cures and reduced resilience. The situation underscores the
importance of sustained government investment in public health research and
programs for the benefit of all citizens.
Putting it all together
So, in a couple steps, we did what we wanted to do! However, what if I want to concisely run those steps several times, with different queries?
Creating a function
We’ll create a function called summarize_bluesky_topic! Additionally, this keeps the intermediate variables out of our environment, reducing clutter.
# clear environment - all objects
rm(list = ls())
# create our function - that goes into environment for later use
summarize_bluesky_topic <- function(topic = "news", # default query
n_posts_pull = 20, # initial grab of posts
n_posts_summarize = 5, # summarizes top liked
include_posts = F) # default to just summary
{
# pull from bluesky
raw_pull <- bs_search_posts(
query = topic,
limit = n_posts_pull,
sort = "latest"
)
# pull out the posts' text content
text <- modify(raw_pull$record, function(x) x[["text"]]) %>%
str_remove_all(pattern = "\n|\r|")
# add other cols, filter as desired
tidied_text <- tibble(
text = text,
when = modify(raw_pull$record, function(x) x[["createdAt"]])
) %>%
bind_cols(raw_pull %>% select(matches("count"))) %>%
arrange(-like_count) %>%
head(n_posts_summarize)
chat <- chat_ollama(
model = "gemma3",
system_prompt = paste0(
"You are an assistant to a very ",
"busy professional, to whom you ",
"must provide a concise summary ",
"of the following information."
)
)
chat$chat(paste0(
"Synthesize a short summary paragraph (3-4 sentences total)",
" on these posts from BlueSky, from various users.",
tidied_text$text %>% paste0(collapse = ". ")
))
if (include_posts == T) {
tidied_text$text
}
}Now that we have this function set up, we can run that whole process in one line!
# Run with defaults - recent skeets with "news"
default_run <- summarize_bluesky_topic()Here’s a concise summary for your professional:
Multiple BlueSky posts highlight concerning news: a critical crash in
Cambridgeshire, a significant security concern regarding North Korea’s threat
level, and exciting news about Gordon Ramsay attending a theatre production.
These updates indicate a mix of serious events and a notable cultural event –
please monitor developments in both areas.
Vibe check
This is great, but how do we know the LLM is summarizing the posts accurately? We should compare the output to the original posts, of course! I included an argument just for this purpose. Although, of course, a second run of the function will pull entirely new data, so we do not expect it to align with what we just saw, especially since there are so many posts with ‘news’ on BlueSky.
# Run with defaults - recent skeets with "news"
default_run <- summarize_bluesky_topic(include_posts = T)Here’s a concise summary for your professional:
This BlueSky thread highlights a diverse range of interests, from admiring Star
Wars creator Peach Momoko to appreciating historical figures like those at
Bletchley Park. Several users expressed concerns about media bias and economic
trends, including a successful Nvidia investment and speculation surrounding
retail sales. It appears a mix of thoughtful observations and passionate
opinions are circulating.
# now, we need to print the resultant object, a character vector of the posts
default_run[1] "I admire Peach Momoko's work so much, I was nervous to chat with her about Star Wars! Read about Peach and her history as a fan today on SW dot com: www.starwars.com/news/creator..."
[2] "Having worked at CERN for 5 years I particularly enjoyed this one:home.cern/news/news/ph..."
[3] "I keep suggesting but everyone pushes back, if Trump can sue the MSM for telling the truth, why can't Americans sue them for all the right wing lies they allow on their \"news\" outlets?The six companies that own 90% of our media were forced to yield when the voting machine companies went after them."
[4] "It is sad to see retail sales drying up. Most likely this is dip buyers running low on cash. But the good news is if March 31st was the bottom They ran out of money at exactly the right time! The day I bought 300 more shares of Nvidia at a 44% discount! 22x forward earnings and a 0.56 PEG!"
[5] "www.bbc.com/news/article...Amazing, considering I had never heard of Bletchley Park until having just read The Rose Code by Kate Quinn. These men and women did astonishing work during WWII. I’m in awe …."
OK, that sort of makes sense, based on their content and the lack of additional context. However, is it drawing connections between unrelated news stories?
Don’t forget - this pulls social media posts off the internet. So, we are starting with completely unvalidated data, and running it through a complex probabilistic model. Even if the model works well, the information could be completely untrue before it’s input. Additionally, the model itself could hallucinate or lack vital context to interpret the information!
Additional query examples
summarize_bluesky_topic("rstats")Okay, here’s a concise summary for your professional:
“Several BlueSky users are highlighting valuable resources for data
visualization using R, particularly focusing on the #30DayChartChallenge and
the ‘Positron’ IDE. Sara Altman is sharing a detailed walkthrough on
integrating LLMs with Shiny apps via the Ellmer package, while Greg Dubrow is
leveraging Danmarks Statistik data for his challenge. Users are also discussing
desired R features and the evolving pains & joys of using Positron.”
summarize_bluesky_topic("Washington Spirit")Okay, here’s a concise summary for you:
“Now that America is experiencing a shift in leadership, Europe is positioning
itself to lead the global conversation. Recent sporting events – including
impressive build-up plays for the Washington Spirit and attendance figures at
D.C. United and the D.C. Defenders – highlight a broader focus on sporting
initiatives across the Atlantic.”
summarize_bluesky_topic("public health")Okay, here’s a concise summary for your professional:
“Multiple concerning reports are emerging regarding the impact of public health
cuts, particularly at the NIH, which threatens critical research and patient
care. Simultaneously, there’s a growing concern about the dismantling of public
health systems globally, exemplified by attacks on healthcare in Canada and
potential liability for pesticide companies. These developments collectively
underscore a weakening of public health infrastructure with significant
consequences for both domestic and international well-being.”
summarize_bluesky_topic("democracy")Okay, here’s a concise summary for you:
“Senator Booker is demonstrating crucial resistance against a concerning agenda
involving Trump, Musk, and the GOP. The conversation highlights the critical
role of legal professionals in safeguarding democracy, warning against those
who would undermine it. There’s a call to action to mobilize support for
Senator Booker and demand accountability from the current administration,
emphasizing the need to find bipartisan support for democratic principles.”
It seems like more specific topics do better, as the information can be more cohesively combined.
Increasing the number of posts
More data will typically take longer to run. Let’s get rid of the Washington Spirit query, as that one was also pulling political posts.
summarize_bluesky_topic("rstats",
n_posts_pull = 100,
n_posts_summarize = 20
)Here’s a concise summary of the BlueSky posts:
The R community is buzzing with activity around the #30DayChartChallenge, with
a focus on data visualization and analysis. Several users are showcasing
creative waffle plots and treemaps related to prompts like “Fractions” and
“Comparing Proportions.” There’s also increased discussion around best
practices for choosing R packages, emphasizing trust and security, alongside
announcements for events like the R/Medicine 2025 conference and the launch of
a new R package.
summarize_bluesky_topic("public health",
n_posts_pull = 100,
n_posts_summarize = 20
)Here’s a concise summary of the BlueSky posts, aiming for 3-4 sentences:
A growing coalition of 23 states and Washington, D.C. are suing the Trump
administration over the rollback of $12 billion in public health funding. These
cuts threaten vital programs like Title X and impact responses to pandemics,
climate change, and overall public health initiatives. The move is seen as
detrimental, potentially leading to worsened health outcomes and delays in
medical research and treatment. The situation is generating significant concern
among state attorneys general and highlights the critical importance of
sustained investment in public health.
summarize_bluesky_topic("democracy",
n_posts_pull = 100,
n_posts_summarize = 20
)Here’s a concise summary of the BlueSky posts, aiming for 3-4 sentences:
A significant wave of BlueSky posts expresses concern over the current
political climate, particularly regarding potential attacks on democratic
institutions. Numerous users, including Senator Cory Booker, are actively
engaging in sustained protests and public displays of resistance, primarily
focused on countering what they perceive as a dangerous trend towards
authoritarianism. There’s a strong emphasis on mobilizing public support and
demanding accountability from political figures, with calls for solidarity
across various causes, including human rights and democratic values. Many users
highlight the fragility of democratic norms and the need for continuous
vigilance and action.
Hmm, do you think these outputs have improved, since they are based on more information?
Conclusion
Well, what do you think? Was our function effective?
Could these techniques be useful in the real world, beyond a fun side project? What concerns do you have about the limitations and ethical implications?
What other ideas do you have for analyzing BlueSky data, beyond simple summarization with an LLM? I wonder if we could track sentiment surrounding a queried subject over time using sentiment analysis.